41 research outputs found

    Parsing MetaMap Files in Hadoop

    Get PDF
    The UMLS::Association CUICollector module identifies UMLS Concept Unique Identifier bigrams and their frequencies in a biomedical text corpus. CUICollector was re-implemented in Hadoop MapReduce to improve algorithm speed, flexibility, and scalability. Evaluation of the Hadoop implementation compared to the serial module produced equivalent results and achieved a 28x speedup on a single-node Hadoop system

    Evaluating Feature Extraction Methods for Biomedical Word Sense Disambiguation

    Get PDF
    Evaluating Feature Extraction Methods for Biomedical WSD Clint Cuffy, Sam Henry and Bridget McInnes, PhD Virginia Commonwealth University, Richmond, Virginia, USA Introduction. Biomedical text processing is currently a high active research area but ambiguity is still a barrier to the processing and understanding of these documents. Many word sense disambiguation (WSD) approaches represent instances of an ambiguous word as a distributional context vector. One problem with using these vectors is noise -- information that is overly general and does not contribute to the word’s representation. Feature extraction approaches attempt to compensate for sparsity and reduce noise by transforming the data from high-dimensional space to a space of fewer dimensions. Currently, word embeddings [1] have become an increasingly popular method to reduce the dimensionality of vector representations. In this work, we evaluate word embeddings in a knowledge-based word sense disambiguation method. Methods. Context requiring disambiguation consists of an instance of an ambiguous word, and multiple denotative senses. In our method, each word is replaced with its respective word embedding and either summed or averaged to form a single instance vector representation. This also is performed for each sense of an ambiguous word using the sense’s definition obtained from the Unified Medical Language System (UMLS). We calculate the cosine similarity between each sense and instance vectors, and assign the instance the sense with the highest value. Evaluation. We evaluate our method on three biomedical WSD datasets: NLM-WSD, MSH-WSD and Abbrev. The word embeddings were trained on the titles and abstracts from the 2016 Medline baseline. We compare using two word embedding models, Skip-gram and Continuous Bag of Words (CBOW), and vary the word vector representational lengths, from one-hundred to one-thousand, and compare differences in accuracy. Results. The overall outcome of this method demonstrates fairly high accuracy at disambiguating biomedical instance context from groups of denotative senses. The results showed the Skip-gram model obtained a higher disambiguation accuracy than CBOW but the increase was not significant for all of the datasets. Similarly, vector representations of differing lengths displayed minimal change in results, often differing by mere tenths in percentage. We also compared our results to current state-of-the-art knowledge-based WSD systems, including those that have used word embeddings, showing comparable or higher disambiguation accuracy. Conclusion. Although biomedical literature can be ambiguous, our knowledge-based feature extraction method using word embeddings demonstrates a high accuracy in disambiguating biomedical text while eliminating variations of associated noise. In the future, we plan to explore additional dimensionality reduction methods and training data. [1] T. Mikolov, I. Sutskever, K. Chen, G. Corrado and J. Dean, Distributed representations of words and phrases and their compositionality, Advances in neural information processing systems, pp. 3111-3119, 2013.https://scholarscompass.vcu.edu/uresposters/1278/thumbnail.jp

    Using natural language processing techniques to inform research on nanotechnology

    Get PDF
    Literature in the field of nanotechnology is exponentially increasing with more and more engineered nanomaterials being created, characterized, and tested for performance and safety. With the deluge of published data, there is a need for natural language processing approaches to semi-automate the cataloguing of engineered nanomaterials and their associated physico-chemical properties, performance, exposure scenarios, and biological effects. In this paper, we review the different informatics methods that have been applied to patent mining, nanomaterial/device characterization, nanomedicine, and environmental risk assessment. Nine natural language processing (NLP)-based tools were identified: NanoPort, NanoMapper, TechPerceptor, a Text Mining Framework, a Nanodevice Analyzer, a Clinical Trial Document Classifier, Nanotoxicity Searcher, NanoSifter, and NEIMiner. We conclude with recommendations for sharing NLP-related tools through online repositories to broaden participation in nanoinformatics

    Chrono: A System for Normalizing Temporal Expressions

    Get PDF
    The Chrono System: Chrono is a hybrid rule-based and machine learning system written in Python and built from the ground up to identify temporal expressions in text and normalizes them into the SCATE schema. Input text is preprocessed using Python’s NLTK package, and is run through each of the four primary modules highlighted here. Note that Chrono does not remove stopwords because they add temporal information and context, and Chrono does not tokenize sentences. Output is an Anafora XML file with annotated SCATE entities. After minor parsing logic adjustments, Chrono has emerged as the top performing system for SemEval 2018 Task 6. Chrono is available on GitHub at https://github.com/AmyOlex/Chrono. Future Work: Chrono is still under development. Future improvements will include: additional entity parsing, like “event”; evaluating the impact of sentence tokenization; implement an ensemble ML module that utilizes all four ML methods for disambiguation; extract temporal phrase parsing algorithm to be stand-alone and compare to similar systems; evaluate performance on THYME medical corpus; migrate to UIMA framework and implement Ruta Rules for portability and easier customization

    Development of a targeted and controlled nanoparticle delivery system for FoxO1 inhibitors

    Get PDF
    Background: Poly (lactic-co-glycolic acid) (PLGA) and polyethylene glycol (PEG) are polymers approved by the United States’ Food and Drug Administration. Drugs for various medical treatments have been encapsulated in PLGA-PEG nanoparticles for targeted delivery and reduction of unwanted side effects. Methods: A flow synthesis method for PLGA-PEG nanoparticles containing FoxO1 inhibitors and adipose vasculature targeting agents was developed. A set of nanoparticles including PLGA and PLGA-PEG-P3 unloaded and drug loaded were generated. The particles were characterized by DLS, fluorescence spectroscopy, TEM, and dialysis. Endotoxin levels were measured using the LAL chromogenic assay. Our approach was compared to over 270 research articles using information extraction tools. Results: Nanoparticle hydrodynamic diameters ranged from 142.4 ±0.4 d.nm to 208.7 ±3.6 d.nm while the polydispersity index was less than 0.500 for all samples (0.057 ±0.021 to 0.369 ±0.038). Zeta potentials were all negative ranging from -4.33 mV to -13.4 mV. Stability testing confirmed that size remained unchanged for up to 4 weeks. For AS1842856, loading was 0.5 mg drug/mL solution and encapsulation efficiency was ~100%. Dialysis indicated burst release of drug in the first 4 hours. Conclusion: PLGA encapsulation of AS1842856 was successful but unsuccessful for the two more hydrophilic drugs. Alternative syntheses such as water/oil/water emulsion or liposomal encapsulation are being considered. Analysis of data from published papers on PLGA nanoparticles indicated that our results were consistent with identified process-structure relationships and few groups reported endotoxin levels even though in vivo testing was performed.https://scholarscompass.vcu.edu/gradposters/1071/thumbnail.jp

    Machine Assisted Experimentation of Extrusion-Based Bioprinting Systems

    Get PDF
    Optimization of extrusion-based bioprinting (EBB) parameters have been systematically conducted through experimentation. However, the process is time- and resource-intensive and not easily translatable to other laboratories. This study approaches EBB parameter optimization through machine learning (ML) models trained using data collected from the published literature. We investigated regression-based and classification-based ML models and their abilities to predict printing outcomes of cell viability and filament diameter for cell-containing alginate and gelatin composite bioinks. In addition, we interrogated if regression-based models can predict suitable extrusion pressure given the desired cell viability when keeping other experimental parameters constant. We also compared models trained across data from general literature to models trained across data from one literature source that utilized alginate and gelatin bioinks. The results indicate that models trained on large amounts of data can impart physical trends on cell viability, filament diameter, and extrusion pressure seen in past literature. Regression models trained on the larger dataset also predict cell viability closer to experimental values for material concentration combinations not seen in training data of the single-paper-based regression models. While the best performing classification models for cell viability can achieve an average prediction accuracy of 70%, the cell viability predictions remained constant despite altering input parameter combinations. Our trained models on bioprinting literature data show the potential usage of applying ML models to bioprinting experimental design

    Creation of an Annotated Library on FDA Approved Nanomedicines

    Get PDF
    Nanomedicine is a type of nanotechnology used in the medical field to limit the dosage amount and target drug delivery to specific cells. Nanomedicines that are approved and used tend to be extremely successful; however despite over a decade of research, only a limited number of nanomedicines have advanced for clinical use. A possible reason for the numerous nanomedicine failures is lack of easily accessible information and research on previous nanomedicines. In this project, we have compiled nanomedicine labeling information from the Drugs@FDA website. We have extracted phrases/sentences from labels relating to keywords on nanomaterial properties and drug profile characteristics. In the future, we plan to incorporate discontinued nanomedicines, nanomedicines on the market, and nanomedicines in different clinical trial phases. By compiling the descriptions and contents of a set of specific nanomedicines, a machine learning program could be developed to comb through literature and automatically identify similar nanomedicine related entities. Our research works to provide an easier and quicker method to obtain specific information on approved nanomedicines.https://scholarscompass.vcu.edu/uresposters/1175/thumbnail.jp

    Vector Representations of Multi-Word Terms for Semantic Relatedness

    Get PDF
    Vector Representations of Multi-Word Terms for Semantic Relatedness Sam Henry, Clint Cuffy and Bridget T. McInnes, PhD Introduction: Semantic similarity and relatedness measures quantify the degree to which two concepts are similar (e.g. liver-organ) or related (e.g. headache-aspirin). These metrics are critical to improving many natural language processing tasks involving retrieval and clustering of biomedical and clinical documents and developing biomedical terminologies and ontologies. Numerous ways exist to quantify these measures between distributional context vectors but no direct comparison between these metrics and exploration of representing multi-word context vectors. We explore several multi-word aggregation methods of distributional context vectors for the task of semantic similarity and relatedness in the biomedical domain. Methods: We use two multi-word aggregation methods including the summation and averaging of component word vectors. The direct creation of multi-word vectors using our compoundify tool and creation of concept vectors using the Metamap tool are also utilized to generate a single vector representation for multi-word terms. Along with these methods, we employ three vector dimensionality reduction techniques: singular value decomposition (SVD), word embeddings using word2vec’s continuous bag of words (CBOW) and skip-gram (SG) approaches. Lastly, explicit vectors of word-to-word, term-to-term, or component-to-component co-occurrences are used as a baseline. Lastly, we measure differences between vector dimensionalities consisting of comparison lengths 100, 200, 500, 1000, 1500 up to 3000. Results: We evaluate the metrics on the UMNSRS and MiniMayoSRS evaluation reference standards. Results show lower dimensional vectors word2vec’s concept vectors (CBOW and SG) with vector dimensionality of 200 to outperform explicit and SVD. SVD performs best with the vector dimensionality of 1000. Between multi-term aggregation methods, the choice was arbitrary. Combining single terms to create multi-word terms pre or post training showed little statistical significance between all dimensionality reduction techniques and vector dimensionalities. Conclusions: In general, there is no increase in correlation between word2vec’s SG versus CBOW in biomedical context. Relatively high accuracy with little computational complexity was shown using the sum or mean of context vectors to create a single vector representation for multi-word terms. Although the method of generating distributional context vectors differ; both have their strengths and weaknesses depending on the hyper-parameters utilized.https://scholarscompass.vcu.edu/uresposters/1277/thumbnail.jp

    Exploiting MeSH indexing in MEDLINE to generate a data set for word sense disambiguation

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Evaluation of Word Sense Disambiguation (WSD) methods in the biomedical domain is difficult because the available resources are either too small or too focused on specific types of entities (e.g. diseases or genes). We present a method that can be used to automatically develop a WSD test collection using the Unified Medical Language System (UMLS) Metathesaurus and the manual MeSH indexing of MEDLINE. We demonstrate the use of this method by developing such a data set, called MSH WSD.</p> <p>Methods</p> <p>In our method, the Metathesaurus is first screened to identify ambiguous terms whose possible senses consist of two or more MeSH headings. We then use each ambiguous term and its corresponding MeSH heading to extract MEDLINE citations where the term and only one of the MeSH headings co-occur. The term found in the MEDLINE citation is automatically assigned the UMLS CUI linked to the MeSH heading. Each instance has been assigned a UMLS Concept Unique Identifier (CUI). We compare the characteristics of the MSH WSD data set to the previously existing NLM WSD data set.</p> <p>Results</p> <p>The resulting MSH WSD data set consists of 106 ambiguous abbreviations, 88 ambiguous terms and 9 which are a combination of both, for a total of 203 ambiguous entities. For each ambiguous term/abbreviation, the data set contains a maximum of 100 instances per sense obtained from MEDLINE.</p> <p>We evaluated the reliability of the MSH WSD data set using existing knowledge-based methods and compared their performance to that of the results previously obtained by these algorithms on the pre-existing data set, NLM WSD. We show that the knowledge-based methods achieve different results but keep their relative performance except for the Journal Descriptor Indexing (JDI) method, whose performance is below the other methods.</p> <p>Conclusions</p> <p>The MSH WSD data set allows the evaluation of WSD algorithms in the biomedical domain. Compared to previously existing data sets, MSH WSD contains a larger number of biomedical terms/abbreviations and covers the largest set of UMLS Semantic Types. Furthermore, the MSH WSD data set has been generated automatically reusing already existing annotations and, therefore, can be regenerated from subsequent UMLS versions.</p
    corecore